Integrating image and gene-data with a semi-supervised attention model for prediction of KRAS gene mutation status in non-small cell lung cancer

KRAS is a pathogenic gene frequently implicated in non-small cell lung cancer (NSCLC). However, biopsy as a diagnostic method has practical limitations. Therefore, it is important to accurately determine the mutation status of the KRAS gene non-invasively by combining NSCLC CT images and genetic data for early diagnosis and subsequent targeted therapy of patients. This paper proposes a Semi-supervised Multimodal Multiscale Attention Model (S2MMAM). S2MMAM comprises a Supervised Multilevel Fusion Segmentation Network (SMF-SN) and a Semi-supervised Multimodal Fusion Classification Network (S2MF-CN). S2MMAM facilitates the execution of the classification task by transferring the useful information captured in SMF-SN to the S2MF-CN to improve the model prediction accuracy. In SMF-SN, we propose a Triple Attention-guided Feature Aggregation module for obtaining segmentation features that incorporate high-level semantic abstract features and low-level semantic detail features. Segmentation features provide pre-guidance and key information expansion for S2MF-CN. S2MF-CN shares the encoder and decoder parameters of SMF-SN, which enables S2MF-CN to obtain rich classification features. S2MF-CN uses the proposed Intra and Inter Mutual Guidance Attention Fusion (I2MGAF) module to first guide segmentation and classification feature fusion to extract hidden multi-scale contextual information. I2MGAF then guides the multidimensional fusion of genetic data and CT image data to compensate for the lack of information in single modality data. S2MMAM achieved 83.27% AUC and 81.67% accuracy in predicting KRAS gene mutation status in NSCLC. This method uses medical image CT and genetic data to effectively improve the accuracy of predicting KRAS gene mutation status in NSCLC.


Introduction
Lung cancer is specifically divided into non-small cell lung cancer (NSCLC) and small cell lung cancer.NSCLC accounts for approximately 85% of newly diagnosed lung cancers yearly [1].The emergence of targeted therapy has substantially increased the survival rate of NSCLC patients.Prior to targeted therapy, it should be determined whether important disease-causing genes are mutated.KRAS is a common causative gene in NSCLC, and approximately onethird of patients with NSCLC have KRAS mutations.The usual diagnostic tool is a puncture biopsy.However, this invasive method has many limitations, such as it is unsuitable for all body types and has unpredictable consequences such as increased risk of cancer metastasis [2].Therefore, there is an urgent need for a non-invasive diagnostic method that can accurately predict KRAS mutations in lung cancer patients.This method will not only improve the treatment outcome of patients but also guide prognosis.
In recent years, researchers have used CT images to predict gene mutations based on traditional radiomics and machine learning.Song et al. [3] propose a machine-learning model for predicting EGFR and KRAS mutation status.They used the model to extract statistical, shape, pathological, and deep learning features from 144 CT scans of tumor regions.Shiri et al. [4] used minimum redundancy, maximum correlation feature selection, and random forest classifier to build a multivariate model.The model analyzed radiological features extracted from images of tumors and successfully predicted EGFR and KRAS mutation status in cancer patients.
The radiomics and machine learning methods mentioned above have successfully predicted gene mutations.However, most of these methods rely on hand-crafted features.In recent years, deep learning based on convolutional neural networks has attracted much attention in the field of medical image computing.This data-driven approach can automatically extract complex image features [5][6][7].In addition, imaging genomics is more expected to develop in the field of deep learning than single modality data for analytical studies.It integrates disease imaging data and genomic data.Imaging genomics is a high-throughput research method correlating imaging features with genomic data.In recent imaging genomics studies, researchers have proposed a series of deep learning algorithms and theoretical models based on image or genetic data.Dong et al. [8] proposed a multichannel and multitasking deep learning (MMDL) model.They used the fusion of radiological features of CT images and clinical information of patients to improve the accuracy of the model to predict KRAS gene mutations.Hou et al. [9] proposed a multimodal information fusion module based on attention that successfully predicted lymph node metastasis using deep learning features of CT images fused with genetic data.Therefore, machine learning and deep learning-based imaging genomics approaches have great potential and application in predicting KRAS gene mutation status in NSCLC.
Although the above model achieved considerable performance, there are still some challenges in the study of deep learning methods based on image and genetic data for predicting KRAS mutation status in NSCLC: 1) Majority of deep learning methods [8,9] that study classification tasks focus only on classification methods.However, these studies did not use the segmentation features generated by the segmentation task to facilitate the classification task to improve the performance and effectiveness of the classification task.Lesion segmentation and classification are two highly related tasks.The segmentation can help remove distractions from CT images and thus is highly beneficial for improving the accuracy of lesion classification.2) Most of the studied fusion methods used simple fusion means of direct concatenation.However, they ignore the correlation and difference between medical images and genetic data.It not only leads to ineffective mining of useful semantic features between multi-scale image features and gene features but also fails to make full use of the complementarity of multimodal information.3) Many studies used models that overemphasized the deep features of lesion abstraction.Nonetheless, they did not pay sufficient attention to the importance of detailed shallow features in prediction results.This leads to limitations in improving accuracy.
To overcome these difficulties and achieve non-invasive and accurate prediction of KRAS gene mutations in NSCLC.We propose a Semi-supervised Multimodal Multiscale Attention Model (S 2 MMAM) for predicting KRAS gene mutation status in NSCLC.The model uses the Mean Teacher [10] framework as the main structure of the network.Mean Teacher can make full use of labeled images to achieve analytical prediction of unlabeled images in order to diminish the dependence of the network on manual annotation.In order to compensate for the information loss of single-modal unlabeled image data to the network, the model not only uses Semi-supervised Multimodal Fusion Classification Networks (S 2 MF-CN) to share the parameter strategy of the Supervised Multilevel Fusion Segmentation Network (SMF-SN) to enrich the key information of the lesion.S 2 MMAM also multimodally fuses the patient's genetic data with the image data to expand the mutation knowledge.Specifically, SMF-SN designs a new Triple Attention-guided Feature Aggregation (TAFA) module.It aims to adaptively fuse high-level semantic features with low-level semantic features using an attention-guided mechanism.TAFA can ignore background noise and localize the extraction of lesion key features.In S 2 MF-CN, we propose an Intra and Inter Mutual Guidance Attention Fusion (I 2 MGAF) module to guide the fusion between inter-information and between intra-information in a staged manner.I 2 MGAF can effectively extract complementary information from different modalities at different scales to facilitate classification efficiency improvement.
In contrast to conventional radiomics and machine learning [3,4], we used a convolutional neural network technique for CT image feature extraction as compared to previous studies for KRAS mutation prediction.This technique is more efficient and reduces the cost of manual annotation.Moreover, it can realize the prospect of end-to-end applications.Studies [5][6][7][8][9] that have made predictions for other diseases in multimodal-based classification tasks have used simple multimodal fusion methods.In contrast, our proposed method focuses more on extracting different dimensions of information from different modal data to achieve complementary fusion.
The contributions of this paper are as follows:

Mean Teacher in semi-supervised learning
Semi-supervised learning has been studied in the medical imaging community for a long time [11,12].It can reduce the human workload on labeled data.Current research has shown the potential to improve network performance when labels are scarce.There are three semi-supervised models based on the principle of consistency: the P-Model [13], Temporal Ensembling (TE) [13], and the Mean Teacher model.In order to show the advantages and disadvantages of three consistency-based semi-supervised methods more succinctly, we summarize Table 1, which allows a more precise comparison of the three approaches.
In recent years, Mean Teacher has achieved good results as a basic framework in semisupervised classification tasks.Wang et al. [14] successfully identified diabetic macular edema based on the Mean Teacher model using a small amount of roughly labeled data and a large amount of unlabeled data.Liu et al. [15] used the Mean Teacher-based framework of the network model to successfully achieve skin lesion diagnosis with ISIC 2018 challenge and thorax disease classification with ChestX-ray14.Wang et al. [16] proposed a model that unifies diverse knowledge into a generic knowledge distillation framework for skin disease classification.It enables the student model to acquire richer knowledge from the faculty model.The above model demonstrates that Mean Teacher achieves excellent results in semi-supervised classification tasks, so we use it as the basic framework for Our S 2 MMAM.

Segmentation facilitates classification
Using segmentation tasks to facilitate classification network tasks is a basic form of multitask learning [17].In multitask learning, the segmentation task associated with the classification task can assist the learning of the target by the classification task, thus improving the performance of the classification task [18].Similarly, in a single-task classification model, this idea is borrowed from above.The information captured by the segmentation branch of the model can be transferred to the classification model to expand the foci information.The supervised segmentation task is trained using masked labeled data.The aim is to obtain the most comprehensive high-level semantic features of the target region and reduce the learning of noisy backgrounds.Rich segmentation features can support the classification task to learn more and richer semantic information.Thus, a supervised segmentation network can assist the classification task by suppressing the background noise introduced by missing physician labeling information in semi-supervised classification networks and improving the classification accuracy.
According to Table 2, the above works demonstrate that segmentation has a facilitating effect on classification.However, there is a common problem: they are all studied for supervised models.Supervised models have high requirements for data labeling costs.We believe that the combination of segmentation and classification tasks can make the network more informative.Therefore, our research aims to combine the idea of segmentation facilitating classification with semi-supervised models.We combined two related tasks of NSCLC lesion segmentation and KRAS gene mutation status prediction.S 2 MMAM allows S 2 MF-CN to obtain the key features of lesions upon initialization through the strategy of sharing network parameters between SMF-SN and S 2 MF-CN.In S 2 MF-CN, the segmentation features are guided to merge with the classification features to obtain the extracted key features.This strategy can enrich the lesion information and improve the network model classification performance.

Multiscale features and attention learning
Traditional convolution operations mostly focus on extracting local features.However, due to the limited information contained in local features, the model cannot learn the full range of region of interest contents well.Multi-scale features contain local features of multiple regions of interest.The extracted local features are fused with other operations to obtain comprehensive information about the target, which helps the network model to learn.To extract multiscale features, The Atrous Spatial Pyramid Pooling (ASPP) module [21] captures contextual information by multi-step convolution of the target region using different expansion rates.In the medical image domain, the PSE [22] module uses a patch-level pyramid design to extend SE operations to multiple scales, allowing the network to adaptively focus on vessels of variable width.The scale-aware Feature Aggregation (SFA) module [23] effectively extracts hidden multi-scale background information and aggregates multi-scale features to improve the model's ability to handle complex vasculature.The Convolutional Block Attention Module (CBAM) [24] introduces channel and spatial attention.It extracts multiple key feature information from both dimensions to enrich the network content.In the medical image application domain, Context-assisted full Attention Network (CAN) [25] combines Non-Local Attention (NLA), Channel Attention (CA), and Dualpathway Spatial Attention (DSA) to extract lesion information in multiple directions.
Currently, it is widely believed that both multi-scale features and attention mechanisms can help models enhance the recognition of feature maps from different dimensions.However, the above papers have a common problem: they do not combine the ideas of multi-scale and attention mechanism.Therefore, we combine these two techniques and design the TAFA module.On the one hand, fuse high and low dimensional segmentation features to obtain abstract and detailed information.On the other hand, we fuse segmentation and classification features of different levels to guide the features to learn key factors adaptively and enhance the ability of the network to capture lesions.Thus, the predictive capability of the model is improved.

Overview
In  In the NSCLC dataset, each patient corresponds to a set of CT images and gene data (Section Dataset).Specifically, in our problem setting, we are given a training set containing N labeled data and M unlabeled data where N<<M.Let the labeled training dataset be denoted by , where S L represents dataset for segmentation, C L represents dataset for classification, X i L represents i-th labeled CT image, Y i L represents the pixel-level annotation of X i L and Z i L represents the results of whether the KRAS gene is mutated.Z i L 2 f0; 1g where 0 means negative and 1 means positive.Let the unlabeled training dataset be denoted by , where X i U represents i-th unlabeled image.The entire model pipline can be summarized as follows: First, we pre-train SMF-SN, which is initialized on S L , to train the network's ability to capture focal regions.It can eliminate problems such as large noise from CT and promote the ability of classification-meanwhile, the network body of S 2 MF-CN shares encoder and decoder parameters with SMF-SN.Therefore, the encoder and decoder of S 2 MF-CN are also initialized in this step, and practical segmentation features for different levels of lesions are obtained.The classification network in S 2 MF-CN can capture the key classification features of lesions using these segmentation features.Finally, after S 2 MF-CN fuses segmentation, classification, and genetic data features, the semi-supervised Student Model is trained to determine patients' KRAS gene mutation status accurately.

Supervised multilevel fusion segmentation network
The architecture of SMF-SN.This section introduces a supervised segmentation network based on multidimensional feature fusion.SMF-SN can precisely localize lesion edges and internal regions and greatly reduce the impact of image background noise on network performance.SMF-SN mainly utilizes our proposed SE-ResNeXt and TAFA modules.
We use the enhanced segmentation training dataset S L to train SMF-SN to obtain rich segmentation features.The obtained segmentation features can provide the semi-supervised classification network with a priori information about the lesion location.This improves the classification network's ability to localize and identify lesions.
As shown in Fig 2, SMF-SN includes a stem block, three encoder blocks, three TAFA blocks, a bridge block, three decoder blocks, and an output block.
In the encoder, each encoder is composed of a SE-ResNeXt and a max-pooling layer with step size 2. As shown in Fig 3, SE-ResNeXt is improved from ResNeXt with SENet.ResNext achieves aggregating a set of transitions with the same topology by repeating multiple blocks.SENet can perform feature learning on the aggregated features in the channel dimension to form the importance of each channel.SE-ResNeXt can enhance the network in both the channel and spatial dimensions to capture richer segmentation features.Applying the MaxPooling layer can reduce the spatial dimension of the feature map by half to reduce the computational cost.The output of the encoder is passed through a bridge consisting of SE-ResNeXt and Atrous Spatial Pyramid Pooling (ASPP).It provides the largest receptive domain for TAFA to include a wider range of contextual information, facilitating more efficient integration between multiple levels.Between high-level and low-level semantics, we use the proposed TAFA module.This module utilizes multi-scale and attention fusion mechanisms.The module both suppresses low-level irrelevant background noise and complements each other with contextual difference information, preserving more detailed local semantic information and better learning of focal information.TAFA module is depicted in detail in Section Triple Attention-guided Feature Aggregation.
Triple attention-guided feature aggregation.Since CT images of lung nodules may contain a large amount of noise, for example, there are problems of grayscale overlap between lung tissues, blurred boundaries, and challenging to distinguish.High-level features of the decoder and low-level features of the encoder are crucial for capturing lesion features.However, most of the existing UNet-based connection methods directly connect shallow and deep semantic features of different scales.This behavior ignores that high-level features contain rich semantic information that can help low-level features identify semantically important locations.Likewise, low-level features contain rich spatial information that can help high-level features reconstruct accurate details.
Considering the above factors, we design a Triple attention-guided feature aggregation (TAFA) module to guide the fusion between high and low-dimensional features.TAFA can guide different layers to extract key feature information individually and then fuse after retaining the domain invariant key information, as shown in Fig 4 .In the TAFA module, we first upsample the high-dimensional feature F iþ1 high to have the same size as the low-dimensional feature F i low ði 2 f1; 2; 3gÞ.After that, we perform the high and low-dimensional feature concatenating based on channels to obtain F i C .
Where Concat represents the concatenation operation, f up represents up-sampling operations.Then, to better mine the most useful feature channels between different levels.We introduce a scale channel attention-aware mechanism to automatically select the appropriate receptive domain for the feature map and suppress the interference of irrelevant background noise.We feed the concatenate feature F i C of high and low dimensional features into global average pooling (GAP) and global max pooling (GMP) respectively.TAFA uses the GAP module to excite the feature channel information and the GMP layer to retain the semantic maximum information.Afterward, the corresponding feature maps F i AM and F i MM are obtained using a multi-layer perceptron (MLP) sharing the same parameters.The feature maps F i AM and F i MM are summed.Then the sum feature passes through a sigmoid function to generate a global bootstrap feature coefficient W global .
Where f σ represents sigmoid activation, f mlp represents the MLP operator, f gap represents the global average pooling, f gmp represents the global max pooling.In addition, using the high and low level semantic binding information F i AM and F i MM as guidance, they are combined with high and low dimensional features, respectively, and the high level guidance semantic features F iþ1 high att and low level guidance semantic features F i low att are obtained after the attention operation, respectively.
Finally, the weighted features are concatenated.The concatenated feature maps are multiplied with W global .Then domain-invariant information is captured while reducing the dimensionality through 1x1 convolutional layers to obtain the final fusion module F i .
Our proposed TAFA transfers features from shallower convolutional layers to deeper convolutional layers.Performing the shallow features in the deeper convolutional layers prevents the shallow features from being forgotten.It makes the obtained features have more vital characterization ability.By gradually guiding the fusion between high and low features, SMF-SN can be guided to adaptively combine high and low-dimensional semantic information to reassign feature weights and better capture critical domain invariant information.Thus, lung nodules can be separated from the noise.

Intra Fusion Component (IntraFC)
We propose the IntraFC based on the MultiRes Block, which can capture multi-scale information [26].We adopted a strategy of fusing classification features with segmentation features at each level.The information favoring the prediction of KRAS gene mutation status is jointly retained.The specific structure of the IntraFC component is shown in Fig 5(B).The final level segmentation features F 3 S are subjected to convolutional operations to obtain the initial classification features F C .Due to the problem of induction bias inherent in the convolution mechanism, it is easy to lose the key features of the lesion after multiple convolutions.Therefore, it is necessary for us to fuse the previous segmentation features with the existing classification features to compensate for the bias problem due to the deep network.First we reshape the segmented feature F i S fi 2 ð1; 2; 3Þg through dimensionality until C×W×H is the same size as the classified feature F C .Then, after the segmentation features and classification features are each applied 3×3 convolution.We will introduce the convolutional features from the previous stage and the initial fusion feature F i SC before the subsequent convolution.This can effectively model the correlation between segmentation and classification features.It ensures that the features from the shallow convolutional layer of segmentation and classification are better transferred to the deeper layers.The final fused result F i Intra fi 2 ð1; 2; 3Þg is obtained after several feature fusions.

Inter Fusion Component (InterFC)
We propose the InterFC to find the bidirectional mapping relationship between lung cancer image features and causative genes from the sagittal view (x-axis), coronal view (y-axis), and axial view (z-axis), respectively.InterFC can adaptively enhance the necessary information in different modal features, allowing a more adequate fusion of multimodal features.
The specific structure of the InterFC component is shown in Fig 5(C).The initial classification feature F C , the fusion result F i Intra fi 2 ð1; 2; 3Þg output by IntraFC, and the processed genetic data G are firstly subjected to a splicing operation to obtain the multimodal fusion feature M C .After that, M C is delivered to InterFC to further model the importance of each modal data.
Where Concat denotes the concatenation operation.Then the concatenated multimodal data features are fed to three convolutional layers with BN and ReLU.The size of the convolution kernel is 1×3×1, 3×1×1 and 1×1×3 respectively, to produce three feature maps Quer-y2R C×H×W , Key2R C×H×W and Value2R C×H×W (where C,H,W indicate the channel, height, width of the input features F respectively).We first transpose the Query feature.Then, we perform a softmax layer on the matrix multiplication of Query T and Key to encode the feature relationships in sagittal and coronal views.Finally, matrix multiplication is multiplied with Value to obtain the voxel-level attention enhanced fusion features F Inter , which are then reshaped to be in R C×H×W .

Dataset
In this study, we applied NSCLC-Radiogenomics [27], directly accessible on the Cancer Imaging Archive (TCIA) website.NSCLC-Radiogenomics is part of a public dataset.The patients involved in the dataset have been ethically approved.Users can download the relevant data for research and publication free of charge.Our study is based on open-source data and is therefore free from ethical issues and other conflicts of interest.NSCLC-Radiogenomics has developed a unique radiogenomic dataset from the NSCLC dataset of 211 subjects.The imaging data include mainly CT, semantic annotation of tumors observed on CT images using controlled vocabulary, and segmentation maps of tumor lesions (lung nodules) on CT scans; the genetic data include mainly RNA sequencing (RNA-seq) data.In the training and testing datasets, patients would be excluded for 1) lack of RNA-seq data, 2) lack of CT images, and 3) lack of physician-annotated segmentation maps of CT lesions.After screening, the number of cases with complete images and genetic data was 124.Of the 124 patients, 94 were of the wildtype, and 30 were of the mutation type.The clinical information of these patients is shown in Table 3.All data were randomly divided into training and test datasets in a 4:1 ratio.

Data preprocessing
CT image.In our experiments, for 124 sets of CT images inspired by Cubuk et al. [28], we use the simple procedure of AutoAugment to automatically search for improved data enhancement strategies.By designing a search space in which a strategy consists of many sub-strategies, one sub-strategy is randomly selected for each image in each small batch.The substrategies contain two operations, each of which is an image processing function, such as clipping or applying the probability and magnitude of that function.Thus, we obtained 6696 images with a fixed size of 512×512.
Genes selection.The gene expression data used in this study is RNA-seq data.Since the vast gene dataset contains more than 20,000 gene expression data per patient, the huge amount of gene expression data can significantly increase the computational cost and decrease the prediction accuracy.Therefore, before training the model, we screened the gene expression data from RNA-seq sequencing by the feature selection algorithm [29] to retain the most relevant genes with KRAS mutations.A total of 115 relevant genes were finally screened.The obtained correlated genes were fed into MLP to obtain effective gene features, which achieved mapping high-dimensional gene data to low-dimensional space.

Implementation details
Our model S 2 MMAM is divided into SMF-SN and S 2 MF-CN.The labeled image data applied to SMF-SN is 30% of the total dataset, about 2100 images.The training dataset applied to S 2 MF-CN consists of 30% labeled data and 70% unlabeled data.Our experiments are mainly done on 2 NVIDIA RTX A5000 GPUs and 64 GB of memory.All models in the experiments are trained using 10-fold cross-validation.The specific initialization network configurations are shown in Table 4.

Evaluation metrics
To quantitatively analyze the experimental results, we used six performance metrics to evaluate the classification results obtained, including Accuracy (AC), Recall, Precision, Specificity (SP), Area Under the receiver operating Curve (AUC) and F1 score (F1).They are defined as follows: Where TP is true positive, TN is true negative, FP is false positive, FN is false negative, t pr is the true positive rate, f pr is the false positive rate, X1 and X0 are the confidence scores for negative instances of sexual instances, respectively.

Ablation studies
In this section, we evaluate the impact of the SE-ResNeXt, TAFA module, and the I 2 MGAF module on our S 2 MMAM respectively.
Ablation study of SE-ResNeXt.Using SE-ResNeXt as the backbone of the network can not only enhance the network to extract focal features.It can also take advantage of the lightweight feature of ResNeXt to reduce the computational burden of the network and improve the network's efficiency.To verify the performance of our proposed SE-ResNeXt, we replace the backbone network with S 2 MMAM(UNet), S 2 MMAM(ResNet), S 2 MMAM(ResNeXt) and S 2 MMAM(Inception-V3), respectively.These methods compare with our proposed SE-Res-NeXt on the same dataset.The results are shown in Table 5.
As shown in Table 5, it is evident from the results that our S 2 MMAM(Ours) performed the best in KRAS gene mutation prediction among the five models.S 2 MMAM(Ours) achieved the best results in all six comparative metrics.The AUC was 83.27%, 5.96% higher than the second-place S 2 MMAM(ResNeXt).Compared to the more popular S 2 MMAM (Inception-V3), the AUC was 6.43% higher.SE-ResNeXt has a simpler architecture and lower computational complexity than Inception-v3.SE-ResNeXt effectively eliminates the semantic differences between features by utilizing multi-scale and attention mechanisms.This enables SE-ResNeXt to outperform other traditional networks trained on the data and helps the model to better localize the lesion area.
Ablation study of TAFA module.Using TAFA as the basic module to build S 2 MMAM can better capture the key and complementary information of high-level semantic features and low-level semantic features.It further enhances the feature representation capability, improves the model to extract segmented feature quality and promotes classification performance.To validate the performance of our proposed TAFA, we compare our proposed S 2 MMAM (Ours) with Addition, Concatenation, Adaptive Enhanced Attention Fusion (AEAF) [34], and Adaptive Spatiotemporal Semantic Calibration Module (ASSCM) [35] on the test dataset, respectively.The results are shown in Table 6.
The results show that the highest performance metrics were achieved on the classification task using our proposed S 2 MMAM constructed from TAFA.TAFA (Ours) not only obtained the highest AUC value of 83.27% compared to the other four models.It also achieved the best results on the other five classification performance metrics, with a maximum AC of 81.67% and a maximum SP of 82.66%.The AUC is 4.39% higher compared to the second place AEAF, proving that TAFA can effectively fuse multi-scale information.It proves that our model S 2 MMAM can better detect more patients and effectively reduce the underdiagnosis rate.TAFA achieved 82.73% in F1 score, which is higher than the AEAF at 4.46% and the ASSCM at 4.3%.It is demonstrated that our TAFA has a more stable classification performance and better classification ability.
Ablation study of I 2 MGAF module.The I 2 MGAF module was implemented to guide the fusion of features in segmentation and classification tasks, as well as the fusion of image features with genetic data.To demonstrate that the I 2 MGAF module can better guide the fusion of multimodal and multiscale features in the model.We replaced the IntraFC module in I 2 MGAF with Addition, Concatenation, and Adaptive Feature Fusion (AFF Block) [23], respectively.The InterFC module was replaced with Group Feature Learning (GFL Block) [36] and Non-Local Attention (NLA Block) [25], respectively.The five obtained models are compared with the performance of I 2 MGAF on the classification test dataset.The results are shown in Figs 6 and 7.  From Fig 6, we find that the Concatenation fusion method achieves the lowest AUC value, so the [5][6][7][8][9] method cannot fully take advantage of the multimodal information.AFF Block is 5.9% lower than our IntraFC in AUC.This is due to the fact that AFF Block only focuses on inter-channel fusion of features at different levels, ignoring the potential loss of information due to network depth.Our IntraFC module not only focuses on channel fusion of segmentation and classification features but also solves the problem of information loss caused by multiple fusions.
Fig 7 shows the comparison of the six classification performance metrics after replacing the InterFC module in I 2 MGAF with the GFL Block and NLA Block, respectively.Our InterFC outperforms the second-place NLA Block by 4.11% and 3.6% in AUC and F1 scores, respectively.Our InterFC solves the limitation that NLA Block only focuses on the fusion of information in a single dimension.InterFC can fully combine the information in three dimensions to fuse the data of different modalities and improve the model sensitivity, thus obtaining a better prediction of KARS mutation.

Comparison experiment
We compare the proposed S 2 MMAM with the classical Semi-supervised Learning (SSL), and the recently published SSL image classification models with better results, trained on data with 100% and 30% of labeled data, respectively.Among the classical SSL methods include P-Model [13] and Mean Teacher.The competing methods include Relation-driven Self-ensembling Model (RSM) [15], SS-TBN [37], and DAB [38].Note that we reproduce the above methods on the same testset for the sake of fairness.
Table 7 shows that the key evaluation metrics of S 2 MMAM outperform the other models on both 100% and 30% of the data with labeled data.This means that our S 2 MMAM can be used not only for supervised training but also for semi-supervised applications.We use the fully supervised model with 100% labeled data as the upper bound.And the SSL model trained on 30% labeled data as the target model.As can be seen from the Table 7, S 2 MMAM(Ours) achieved an AUC of 83.27% on the 30% labeled dataset.Mean Teacher only obtains an AUC result of 80.04% on the 100% labeled dataset.This shows the superiority of our S 2 MMAM for the classification task and even achieves accurate prediction with less cost.Compared with other models, our S 2 MMAM has the smallest gap of AUC, which is only 4.65% between 30% of the labeled dataset and the upper bound.This result indicates that our TAFA module and I 2 MGAF module effectively fuse the key features of multi-scale multi-modality.They can solve the problem of feature disappearance due to deep convolution and re-establish the fusion of high and low dimensional semantic key features.Compared with other SSL models that use only CT images for classification, our model has an AUC 6.9% higher than the second best

Superiority of the model
Although ablation studies and comparison experiments have demonstrated the merits of our proposed method, further discussions are needed on 1) the positive effects of segmentation features for the classification task, 2) the superiority of multimodal data over single modal data, and 3) the selection of the proportion of labeled images within the training dataset.We designed three sets of experiments and empirically used data with the proportion of labeled data of 100%, 40%, and 30% as the training dataset.Baseline is used as our base architecture, where Baseline is only constructed by S 2 MF-CN using CT image data for the classification task.Based on this, we conducted a comparative study by gradually adding SMF-SN, genetic data, and both SMF-SN and genetic data.The experimental results are shown in Table 8.

1) The positive effects of segmentation features for the classification task
As shown in Table 8, better classification results are obtained when the model utilizes the idea of segmentation to facilitate classification.Compared to Baseline, Baseline+SMF-SN improves the AUC values by 6.03%, 3.62%, and 4.11% in 30%, 40%, and 100% labeled datasets, respectively.We also visualize some of our Baseline and Baseline+SMF-SN segmentation results in Fig 9 .The results are output in the form of a segmentation graph, which visualizes the ability of the network to localize the lesion area.As can be seen from Fig 9, the model with segmentation task can better localize the lesion area.It can avoid mixing impurities that can easily interfere with the judgment to improve the accuracy of diagnosis.
2) The superiority of multimodal data over single modal data As shown in Table 8, when we used genetic data, the AUC improved by 3.94%, 2.41%, and 2.81%, respectively, compared with Baseline.This indicates that image data can also extract genotypic features from biological data that can express individual differences and reflect disease characteristics at the micro level.Further, enhances the network information richness and promotes the classification performance.
3) The selection of the proportion of labeled images within the training dataset As shown in Table 8, when the proportion of labeled data was 30% and 40%, respectively, the difference in the values of the four metrics was small, with a 0.71% difference in AUC and a 0.83% difference in Recall.Compared with the cost of physician labeling, this result indicates that the guidance information contained in 30% labeled training images is sufficient for the network to learn the key information of the lesion.Therefore, we used 30% labeled images and 70% unlabeled images as the training ratio of the model.To show the classification performance of our S 2 MMAM more visually, we also plotted the 3D comparison histograms of AUC and F1 score, as shown in Figs 10 and 11.
In summary, the strategy of sharing segmentation network parameters by the classification network can assist the network to better localize the lesion region.The complementary nature of multimodal data allows the network to learn more abstract features besides addressing the challenge of less information in semi-supervised strategies.Therefore, our S 2 MMAM is better able to preserve the pathogenic regions, ignore irrelevant information, and improve model sensitivity.This leads to better KRAS mutation prediction results for NSCLC.

Performance in supervised learning
In order to demonstrate the scalability of our model, our application scenarios will not be limited to semi-supervised learning but will be extended to supervised learning.We compare our S 2 MMAM with current multimodal classification models that have better results.The competing methods include Multimodal Feature Fusion Diagnostic Model (MFFDM) [39], PLNM [9].Note that we reproduce the above methods on the same test set for the sake of fairness.
As shown in Table 9, our S 2 MMAM achieved the best AC, SP, and AUC values.This shows that our model has excellent classification performance even in supervised learning applications.The AUC is 1.6% more than the second place PLNM and 3.75% more than the MFFDM.The fusion method of the MFFDM employs a simple splicing fusion, which we believe is the reason for the poor classification performance.Our S 2 MMAM employs a multidimensional fusion, which means that it is better able to adaptively fuse complementary information.Our S 2 MMAM and PLNM are similar in classification performance, but our method achieves better AUC values.We believe that SSL models can achieve the purpose of utilizing limited information to achieve accurate prediction.When we train with more labeled data, our S 2 MMAM can have a better ability to extract information and integrate information.In summary, as described, our S 2 MMAM can be used not only in SSL but also in supervised learning.It is a non-invasive method to determine whether the KRAS gene is mutated or not, to determine the treatment for patients early, and to improve the survival rate of patients.

Conclusion
In this paper, we propose an integrating Image and Gene Data with a Semi-Supervised Attention Model for the Prediction of KRAS Gene Mutation Status in Non-Small Cell Lung.The model consists of two components: supervised multilevel fusion segmentation network (SMF-SN) and semi-supervised multimodal fusion classification network (S 2 MF-CN) fusion.The results on the NSCLC-Radiogenomics dataset demonstrate that S 2 MMAM can achieve a more accurate prediction of KRAS gene mutation status.
However, our S 2 MMAM still has some limitations.First, the model tested in this study used a single dataset and was not tested on multiple different datasets.Second, although CT images have been shown to aid in the prediction of KRAS gene mutations.However, in the clinical setting, histopathology images are the gold standard.We will try to combine CT images, histopathology images, and genetic data to further improve the accuracy of KRAS gene mutation status prediction in non-small cell lung cancer.
this paper, we propose a Semi-supervised Multimodal Multiscale Attention Model (S 2 MMAM).The overall architecture of the model is divided into two parts: Supervised Multilevel Fusion Segmentation Network (SMF-SN) and Semi-supervised Multimodal Fusion Classification Network (S 2 MF-CN), as shown in Fig 1.In this model, the useful information of CT images is captured by SMF-SN and transferred to S 2 MF-CN to facilitate the execution of image prediction tasks.The S 2 MMAM utilizes the fusion of CT images and genetic data to accurately predict whether KRAS is mutated in NSCLC.

Fig 1 .
Fig 1. Overview of our S 2 MMAM, including: (a) Supervised Multilevel Fusion Segmentation Network (SMF-SN).The inputs are CT images and pixel-level mask images, and the outputs are segmented lesion images, (b) Semi-supervised Multimodal Fusion Classification Network (S 2 MF-CN), and (c) processing of gene data.In the S 2 MMAM, the useful information of CT images is captured by SMF-SN and transferred to S 2 MF-CN to facilitate the execution of image prediction tasks.The S 2 MMAM utilizes the fusion of CT images and genetic data to accurately predict whether KRAS is mutated in NSCLC.https://doi.org/10.1371/journal.pone.0297331.g001

The architecture of S 2
MF-CN.The proposed S 2 MF-CN structure is shown in Fig 1(B), which adopts the Mean Teacher model structure as the main framework of the classification network.In Mean Teacher, the Teacher network has the same structure as the Student network.The Student model is the target model to be trained.It assigns the exponential moving average (EMA) of its weights to the Teacher model at each step of training.The predictions of the Teacher model will be considered as additional supervision of the learning of the Student model.Our model uses the final Student model to make predictions.The specific training Student model is shown in Fig 5(A) and consists of three parts: encoder, decoder, and Intra and Inter Mutual Guidance Attention Fusion (I 2 MGAF) Module.The encoder and decoder have the same structure and parameters as the SMF-SN.This allows focusing on the lesion region and capturing the necessary segmentation features through the encoder and decoder.I 2 MGAF performs feature purification using mutual guidance attention modules.It is able to extract multi-scale lung CT image features and genetic features fully.It also performs an adaptive fusion of features through an attentional fusion mechanism for KRAS gene mutation prediction in NSCLC.I 2 MGAF is described in detail in Section Intra and Inter Mutual Guidance Attention Fusion Module.Intra and inter mutual guidance attention fusion module.In the S 2 MF-CN network, we propose an I 2 MGAF module.I 2 MGAF fully fuses multi-scale image segmentation, classification features, and genetic features by using the IntraFC component and InterFC component with a dual attention fusion mechanism.Its aim is to improve the classification capability of the classification network.

Fig 5 .
Fig 5.The overview of the Student Module, including (a) the specific implementation details of the Student Model, (b) Intra fusion component (IntraFC) aims to fuse classification and segmentation features at different levels, and (c) Inter fusion component (InterFC) aims to fuse CT image features and genetic features.https://doi.org/10.1371/journal.pone.0297331.g005

Fig 6
Fig 6 shows a visual comparison of the six classification performance metrics after replacing the IntraFC module in I 2 MGAF with addition, concatenation, and AFF Block, respectively.From Fig 6, we find that the Concatenation fusion method achieves the lowest AUC value, so the [5-9] method cannot fully take advantage of the multimodal information.AFF Block is 5.9% lower than our IntraFC in AUC.This is due to the fact that AFF Block only focuses on inter-channel fusion of features at different levels, ignoring the potential loss of information

Fig 6 .Fig 7 .
Fig 6.Comparison of the classification performance of IntraFC and three models using other fusion methods.https://doi.org/10.1371/journal.pone.0297331.g006 SS-TBN model and 7.21% higher than the DBA model.This is due to our design of a new multimodal fusion module, I 2 MGAF.I 2 MGAF guides the fusion of features for multiple tasks and the fusion of multimodal data.It utilizes segmentation features to facilitate the classification task and efficiently extract important features from different modalities.I 2 MGAF has the ability to compensate for the specificity information that can be easily overlooked by a single data modality and achieve the complementary effects of multi-modal data.As well as to find the pathogenic features of lesions based on multi-dimensionality, thus enhancing the classification ability.We also plot the AUC curves of our S 2 MMAM with the other five models in Fig8to demonstrate the classification performance of our S 2 MMAM more visually.

Fig 9 .
Fig 9. Comparison of the segmentation results obtained after training on Baseline strategy and Baseline+SMF-SN strategy: Baseline: Only classification task.Baseline+SMF-SN: classification task and segmentation task.(a) and (b) are the wild type of NSCLC.(c) and (d) are the mutation of NSCLC.The region surrounded by the red line is the ground truth, and the region surrounded by the green line is the segmentation results.https://doi.org/10.1371/journal.pone.0297331.g009

Table 2 . Comparison of three commonly used consistency-based semi-supervised methods.
Proposed the Mutual Bootstrapping Deep Convolutional Neural Networks (MB-DCNN) model for simultaneous segmentation and classification of skin lesions.The rough lesion masks generated by the segmentation network in MB-DCNN help the classification network for training.The segmentation and classification networks transfer knowledge to each other in a bootstrap manner and facilitate each other.

Table 5 . Comparison of classification performance of UNet, ResNet, ResNeXt, Inception-v3 and SE-ResNeXt on S 2 MMAM
. SE-ResNeXt(Ours) achieved the best results in all six comparative metrics.

Table 6 . Comparison of classification performance of TAFA on S 2 MMAM and four models with different fusion blocks
. TAFA(Ours) achieved the best results in all six comparative metrics. https://doi.org/10.1371/journal.pone.0297331.t006